
Gauss transform is one of several discrete spatial transforms of the form 
%
\beq F(x_j) = \sum_{k=1}^N \kernel f(y_k) \quad \text{at} \quad \{ x_j \, | \, j = 1,...,M \}.  \label{gt} \eeq
%
We refer to the points $ x_j, y_k \in \mathbb{R}^3 $ as the targets and sources respectively. The kernel $G_\delta$ is a
 smooth exponentially decaying function in both the physical and Fourier domains. We shall call such kernels as {\em Gaussian-type}.
  The parameter $\delta$ controls how rapidly the kernel decays.  In the Gauss transform case, $\kernel = e^{-\frac{\norm{x_j - y_k}^2}{\delta}}$. 

Discrete sums of the form (\ref{gt}) are encountered in a variety of disciplines including computational physics,
 machine learning, computational finance and computer graphics \cite{strain94adap, elgammal03, broadie03, kim05, veerapaneni08}. The
 motivation for the present work comes from {\em potential theory} \cite{kress99} applied to  solving linear constant-coefficient
 parabolic partial differential equations (PDEs). Several advantages characterize potential theoretic approaches including reduction
 in dimensionality and exact satisfaction of the far-field conditions. A computational bottle-neck in such approaches is the need to 
 evaluate discrete sums of the form (\ref{gt}). For example, high-order methods for the ``heat potentials'' require evaluating the convolutions \cite{li09, skv09}

% 
\beq F(x) = \int_\Omega \norm{x - y}^{2n} e^{-\frac{\norm{x - y}^2}{\delta}} f(y) \, d\Omega \label{heat} \eeq
% 

where $\Omega$ is the domain and $n$ is a positive integer. Nystr\"{o}m method based on Gaussian quadrature for (\ref{heat}) 
gives rise to discrete sums of the form (\ref{gt}) because the kernel in (\ref{heat}) decays in both physical and Fourier 
domain \cite{fggt}. More examples of Gaussian-type kernels can be found in \cite{victor03}. If the kernel decays rapidly, one 
can simply truncate the sum to a few neighboring sources at each target. However, in many applications including heat potentials, this
 is not the case and computing these sums directly takes $\bigO(NM)$ time. 

{\em Related work.} Because of their fundamental importance, fast schemes for (\ref{gt}) received significant attention in 
the recent past. Starting from the earlier work of Greengard and Strain \cite{fgt}, several sequential
 algorithms \cite{greengard98, sun02, duraiswami03, tausch09, fggt} have been proposed to reduce the cost to an optimal $\bigO(N+M)$. Recently,
  some effort in parallelizing has been made in \cite{rio09} and \cite{yusaku06}; the former in the context of radial basis 
  function (RBF) interpolation using Gaussians and the latter for pricing weather derivatives. The scheme of \cite{rio09} is based on 
  truncating the sum locally and does not generalize to the case where the Gaussian spread is large. In \cite{yusaku06}, only
   smaller ($N, M = \bigO(100)$) one-dimensional problems were considered and they do not report scalability results beyond $n_p = 16$. 
To our knowledge, there have been no parallel implementations to date that compute (\ref{gt}) for highly nonuniform
 distributions and that are scalable in all parameter ranges. 

{\em Contributions.} The main contributions of this work are summarized below.
\begin{itemize} 
\item We describe a novel scheme for the translation of plane wave expansions; this is one of the steps in the
 sequential fast Gausss transform (FGT). Our new scheme reduces the storage and computational costs required for this step compared to previous implementations, especially for highly nonuniform point distributions.
\item We present a parallel version of FGT for uniform distributions incorporating the accelerations introduced in \cite{fggt} for tensor-product grids. We demonstrate the scalability of our algorithm using  upto 120 billion points. To our knowledge, the Gauss transform of such a large number of points has not been computed before.
\item We extend our algorithms to work with nonuniform distributions by using the linear octree data structure. This was motivated by the tree-splitting scheme used in \cite{veerapaneni08} for computing continuous Gauss transforms. The cost of the original FGT algorithm increases as $\delta$ decreases; in contrast, our octree based algorithm scales well in all ranges of the parameters.
\item We also present a parallel implementation of our octree based FGT algorithm.
\end{itemize}

The rest of the paper is organized as follows. In Section \ref{sc:fgt}, we present a short description of the
 standard FGT algorithm and discuss our parallel implementation of the same. We introduce a novel translation scheme 
 in Section \ref{sc:sweep}. This scheme significantly improves the computational and storage costs for 
 nonuniform distributions over the {\em sweeping algorithm} described in \cite{greengard98}. In Section \ref{sc:nonuniform}, we present an octree-based FGT algorithm for nonuniform point distributions; we also discuss
 its parallel implementation. Finally, in Section \ref{sc:results}, we demonstrate the scalability of our algorithms 
 for uniform as well as nonuniform point distributions. In the rest of the paper, we assume the following: (i) the 
 sources and targets coincide, (ii) all points lie within the unit cube $[0, 1]^3$, and (iii) the kernel $G_\delta$ is a Gaussian. These
 assumptions are purely for ease of exposition, all the algorithms are valid in the general case too. 
 
 
\begin{table}[ht!]
\small
\caption{\label{t:notation} Frequently used parameters}
\begin{tabular}{|ll|}
\hline 
FGT parameters &\\
$\delta$                    & bandwidth of the kernel\\
$h$                         & size of an FGT box \\
$\epsilon$                  & user specified precision\\
$p$                         & truncation order of the kernel expansion \\
$K$                         & length of interaction list in each dimension \\
$|B|$                       & number of non-empty FGT boxes \\
$c^B$                       & center of a FGT box B \\
\hline 
Octree parameters &\\
$\ell$                       & a leaf node of the octree \\
$\mathcal{Z}(s)$            & Morton id of point/leaf $s$ \\
$m$                         & upper bound for the number \\ 
                            & of points within any leaf \\
$c$                          & heuristic parameter used to classify \\
                            & leaves as either Expand or Direct \\
$T_d$                       & set of Direct leaves \\
$T_e$                       & set of Expand leaves \\ 
$|N_d|$                     & number of sources/targets in $T_d$ \\ 
$|N_e|$                     & number of sources/targets in $T_e$ \\ 
\hline
\end{tabular}
\end{table}

